NBA Matches


Considerando o crescente uso de ciência dos dados no mercardo esportivo e de especulação, nesta semana vocês farão parte de uma startup que quer quebrar os sites de apostas da NBA!

O mercado online de apostas foi avaliado em US$85.047 no ano de 2019 e pode ter um crescimento ainda maior nos próximos anos levando em consideração a posição favorável de alguns governos com a legalização das plataformas e pagamento de impostos. [1]

Com isso, a startup de vocês, RodaRodaBet, após um estudo inicial sobre o mercado de apostas americano e dos dados disponíveis online sobre a NBA [2], está buscando a construção de um modelo que possa indicar se os times da casa irão ganhar ou perder em cada rodada da liga.

Neste desafio, vocês irão utilizar dados raspados da NBA & ABA League Index, que contém informações sobre os times que jogam em cada rodada da NBA, para prever se determinado time da casa vai ganhar ou perder (Win or Lose).


1 -

2 -

3 -

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_test= pd.read_csv("test_without_label.csv")
df_train = pd.read_csv("train_full.csv")

Entendendo os dados

Pelo fato das variáveis serem as estatísticas dos jogos e por termos bastantes variáveis nesse sentido, optamos por não criar novas variáveis a partir delas. Decidimos investir na variável data. Observamos que a variável dia do ano foi bastante importante para os modelos testados e a partir disso, criamos outras variações, como dias da semana, dias do mês, entre outros.

Porcentagem de dados jogos vencidos e jogos perdidos

# remove espaço nos nomes das colunas
df_train.columns = df_train.columns.str.strip()
df_test.columns = df_test.columns.str.strip()
L    654
W    352
Name: WinOrLose, dtype: int64
y = df_train.WinOrLose.value_counts()/df_train.WinOrLose.value_counts().sum() #frequencia absoluta['L','W'],y)
plt.title('Frequencia absoluta de Vitorias e derrotas')

Temos uma grande maioria de jogos perdidos, portanto eh necessario uma analise estratificada quando for treinar os modelos

Game Data H_Team H_Wins H_Loss H_W/D % H_SRS H_Games H_TotalPoints H_AvgPointsPerGame ... A_TS% A_eFG% A_TOV% A_ORB% A_FT/FGA A_OeFG% A_OTOV% A_DRB% A_OFT/FGA WinOrLose
0 0 Thu, June 8 Miami Heat 52 30 0.634 3.59 82 8191 99.9 ... 0.550 0.495 13.1 31.8 0.285 0.475 13.7 72.2 0.257 L
1 1 Sun, June 11 Miami Heat 52 30 0.634 3.59 82 8191 99.9 ... 0.550 0.495 13.1 31.8 0.285 0.475 13.7 72.2 0.257 L
2 2 Tue, June 13 Dallas Mavericks 60 22 0.732 5.96 82 8130 99.1 ... 0.556 0.517 13.9 26.7 0.254 0.477 12.4 76.4 0.251 L
3 3 Thu, June 15 Dallas Mavericks 60 22 0.732 5.96 82 8130 99.1 ... 0.556 0.517 13.9 26.7 0.254 0.477 12.4 76.4 0.251 L
4 4 Sun, June 18 Dallas Mavericks 60 22 0.732 5.96 82 8130 99.1 ... 0.556 0.517 13.9 26.7 0.254 0.477 12.4 76.4 0.251 L

5 rows × 135 columns


Tratando as datas

treino = df_train
teste  = df_test
from datetime import datetime

# Na base de teste
for i in range(0, teste.shape[0]):
  teste['Data'].iloc[i] = datetime.strptime(teste['Data'].iloc[i], '%a, %B %d')
  teste['Data'].iloc[i] = datetime.strftime(teste['Data'].iloc[i], '%m-%d')

teste['Data'] = pd.to_datetime(teste['Data'], format="%m-%d", errors='raise')

#base de treino
for i in range(0, treino.shape[0]):
  treino['Data'].iloc[i] = datetime.strptime(treino['Data'].iloc[i], '%a, %B %d')
  treino['Data'].iloc[i] = datetime.strftime(treino['Data'].iloc[i], '%m-%d')

treino['Data'] = pd.to_datetime(treino['Data'], format="%m-%d", errors='raise')
C:\Users\msini\Anaconda3\lib\site-packages\pandas\core\ SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:
  self._setitem_single_block(indexer, value, name)

Criacao de algumas features com data

teste['Dia'] =
treino['Dia'] =

teste['Dia'] =
treino['Dia'] =

teste['weekday'] = teste.Data.dt.weekday
treino['weekday'] = treino.Data.dt.weekday

teste['weekofyear'] = teste.Data.dt.weekofyear
treino['weekofyear'] = treino.Data.dt.weekofyear

teste['Dia do Ano'] = teste.Data.dt.dayofyear
treino['Dia do Ano'] = treino.Data.dt.dayofyear
<ipython-input-9-c7af2d33b494>:10: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  teste['weekofyear'] = teste.Data.dt.weekofyear
<ipython-input-9-c7af2d33b494>:11: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  treino['weekofyear'] = treino.Data.dt.weekofyear

Análise exploratória (Variáveis de datas)

vars_dias = ['Dia', 'weekday', 'weekofyear']

for i in vars_dias:


  x = treino[i]
  coluna = i
  mu = round(x.mean(),2) # mean of distribution
  sigma = round(x.std(),2)  # standard deviation of distribution

  f, (ax_box, ax_hist) = plt.subplots(2)

  sns.boxplot(x=x, ax=ax_box)
  sns.histplot(x=x, ax=ax_hist)

  sns.despine(ax=ax_box, left=True)
  ax_box.set_title('Boxplot e Histograma de {}\n $\mu={}$, $\sigma={}$'.format(coluna, mu,sigma))

Weekofyear e Dia do ano possuem um formato de distribuição próximo.

Gráfico de barras (variáveis de datas)

visu = sns.catplot(x = 'weekday', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu = sns.catplot(x = 'weekofyear', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu = sns.catplot(x = 'Dia do Ano', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu = sns.catplot(x = 'Dia', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)

Criacao da feature season (estacao do ano: Primavera, verão, outono e inverno)

lembrar que nos EUA as estacoes do ano sao diferentes

teste['Season'] = teste.Data.dt.month%12 // 3 + 1
treino['Season'] = treino.Data.dt.month%12 // 3 + 1

teste_total = teste.copy()
2    77
3    47
4    41
Name: Season, dtype: int64
2    924
3     82
Name: Season, dtype: int64
y = teste['Season'].value_counts()/teste['Season'].value_counts().sum() #frequencia absoluta['2','3','4'],y)
plt.title('Frequencia absoluta de Season dataframe de teste')

y = treino['Season'].value_counts()/treino['Season'].value_counts().sum() #frequencia absoluta['2','3'],y)
plt.title('Frequencia absoluta de Season dataframe de treino')

Note que os jogos acontecem exclusivamente nas seasons 2, 3 e 4 e veja que no treino temos quase que exclusivamente os jogos acontecendo na season 2, indicando que essa variável talvez não seja muito interessantes para os modelos.

Retirando a coluna Game

Id = teste.Game #sera utilizado para prever depois
teste = teste.iloc[:,1:]

treino = treino.iloc[:,1:]

Retirando a coluna Date

teste.drop('Data', axis=1, inplace= True)

treino.drop('Data', axis=1, inplace= True)

Transformando os dados do tipo object ‘O’ para tipo int

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#base de treino
for i in range(0, len(treino.columns.values)):
  if treino.dtypes[i] == 'O':
    treino.iloc[:, i] = le.fit_transform(treino.iloc[:, i]).astype('int')

#Na base de test
for i in range(0, len(teste.columns.values)):
  if teste.dtypes[i] == 'O':
    teste.iloc[:, i] = le.fit_transform(teste.iloc[:, i]).astype('int')

Feature Selection

Feature Importance (Random Forest)

#Divide the features into Independent and Dependent Variable
X = treino.drop('WinOrLose' , axis =1)
X_completo = X
y = treino['WinOrLose']
y_completo = y.copy()
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

colunas = X.columns
scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
X = scaler_train.fit_transform(X)

#Nao precisa padronizar o teste pq estamos apenas vendo as features de importancia
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)
model  = RandomForestClassifier(), y_train)
def plot_feature_importance(importance,names,model_type):

    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    #Define size of bar plot
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')
plot_feature_importance(model.feature_importances_,colunas,'Random Forest ')

Selecionando k colunas por ordem de importancia

    #Create arrays from feature importance and feature names
    importance = model.feature_importances_
    names = colunas

    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    #Resetando os index para poder selecionar as colunas desejadas
    #Selecionando o numero de colunas que deseja, por ordem de importancia
    select_colunas = fi_df.feature_names[0:14]
 'Dia do Ano',
 'A_W/D %',
 'H_W/D %',
treino_completo = treino.copy()
teste_completo = teste.copy()

treino = treino[select_colunas]

teste = teste[select_colunas]
Dia Dia do Ano weekday weekofyear A_MOV A_SRS H_Wins A_W/D % A_FG% H_Loss H_eFG% H_TS% H_W/D % A_Loss
0 8 159 4 23 6.07 5.96 52 0.732 0.462 30 0.517 0.556 0.634 22
1 11 162 0 24 6.07 5.96 52 0.732 0.462 30 0.517 0.556 0.634 22
2 13 164 2 24 3.87 3.59 60 0.634 0.478 22 0.495 0.550 0.732 30
3 15 166 4 24 3.87 3.59 60 0.634 0.478 22 0.495 0.550 0.732 30
4 18 169 0 25 3.87 3.59 60 0.634 0.478 22 0.495 0.550 0.732 30


Correlation Heatmap

plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
heatmap = sns.heatmap(treino.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
<ipython-input-30-4aa79b6928d7>:3: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance:
  mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
(1006, 14)

Através das 14 primeiras variáveis selecionadas (apenas com a variável dia do ano, sem as outras derivações da Data), utilizamos o método de seleção forward para encontrar o melhor subconjunto de forma a obtermos o melhor resultado, considerando a métrica curva roc. Após observarmos qual foi o melhor subconjunto, fomos eliminando aquelas variáveis que estavam correlacionadas com alguma outra variável. Fizemos isso considerando apenas a variável Dia do Ano na hora de rodar o modelo randon forest (eliminando as outras variações da variável data) e obtivemos como melhores características, através do modelo Naive Bayes, as seguintes variáveis: ‘Dia do Ano’, ‘A_W/D %’, ‘A_FG%’, ‘H_MOV’, ‘H_eFG%’, ‘A_3P%’, ‘A_FT%’. Resultando no Score do Kaggle 0.727

Da mesma forma, fizemos o mesmo procedimento testando as outras variações das variáveis a partir da Data, eliminando a variável Dia do Ano, e obtivemos como melhores variáveis para o modelo de Naive Bayes: ‘Dia’, ‘weekday’, ‘weekofyear’, ‘H_eFG%’,‘A_W/D %’, ‘A_SRS’. E essas variáveis resultaram no melhor score do Kaggle: 0.729

Será reproduzido os resultados para a melhor acurácia que obtivemos no teste e que resultou na melhor classificação do kaggle.

Padronização e Train test split

#As 14 melhores variáveis escolhidas pelo modelo random forest sem a variável dia do Ano
#col = ['Dia', 'weekday', 'weekofyear', 'A_Loss', 'H_eFG%', 'H_MOV', 'A_W/D %', 'H_SRS', 'A_MOV', 'A_Wins', 'A_SRS', 'H_TS%', 'H_W/D %', 'H_Loss']

treino = X_completo[select_colunas]
teste = teste_completo
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
X = scaler_train.fit_transform(treino)

#Vamos padronizar o teste tbm
scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
teste = scaler_train.fit_transform(teste[select_colunas])
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score

from mlxtend.feature_selection import SequentialFeatureSelector

feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
           k_features = 14,
           forward = True,
           verbose = 2,
           scoring = 'roc_auc',
           cv = 5)
# o subconjunto formado por 8 variáveis foi o escolhido: score: 0.615483
features =, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:   18.2s finished

[2021-10-09 17:31:45] Features: 1/14 -- score: 0.5970483694203371[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed:   13.1s finished

[2021-10-09 17:31:58] Features: 2/14 -- score: 0.5919707650839727[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:   12.4s finished

[2021-10-09 17:32:10] Features: 3/14 -- score: 0.5784259204407453[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:   12.4s finished

[2021-10-09 17:32:23] Features: 4/14 -- score: 0.5912465566778236[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   10.2s finished

[2021-10-09 17:32:33] Features: 5/14 -- score: 0.6063153490714137[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    9.1s finished

[2021-10-09 17:32:42] Features: 6/14 -- score: 0.6132538283818607[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    8.0s finished

[2021-10-09 17:32:51] Features: 7/14 -- score: 0.61611956103196[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    7.3s finished

[2021-10-09 17:32:58] Features: 8/14 -- score: 0.6224219513639999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    6.1s finished

[2021-10-09 17:33:04] Features: 9/14 -- score: 0.6082610112259709[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.2s finished

[2021-10-09 17:33:09] Features: 10/14 -- score: 0.6064708539438998[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    4.4s finished

[2021-10-09 17:33:14] Features: 11/14 -- score: 0.5931933295814698[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.3s finished

[2021-10-09 17:33:17] Features: 12/14 -- score: 0.5899106957732295[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.2s finished

[2021-10-09 17:33:19] Features: 13/14 -- score: 0.5786436272622256[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s finished

[2021-10-09 17:33:21] Features: 14/14 -- score: 0.5741880424158052
{1: {'feature_idx': (7,),
  'cv_scores': array([0.55236812, 0.67404698, 0.64054678, 0.5790335 , 0.53924647]),
  'avg_score': 0.5970483694203371,
  'feature_names': ('7',)},
 2: {'feature_idx': (7, 13),
  'cv_scores': array([0.54331921, 0.65941471, 0.64372353, 0.57787832, 0.53551805]),
  'avg_score': 0.5919707650839727,
  'feature_names': ('7', '13')},
 3: {'feature_idx': (1, 7, 13),
  'cv_scores': array([0.57691567, 0.5883712 , 0.61676935, 0.60271467, 0.50735871]),
  'avg_score': 0.5784259204407453,
  'feature_names': ('1', '7', '13')},
 4: {'feature_idx': (1, 7, 10, 13),
  'cv_scores': array([0.57653061, 0.55650751, 0.6380439 , 0.62726223, 0.55788854]),
  'avg_score': 0.5912465566778236,
  'feature_names': ('1', '7', '10', '13')},
 5: {'feature_idx': (1, 2, 7, 10, 13),
  'cv_scores': array([0.59655372, 0.53744705, 0.68569503, 0.61994609, 0.59193485]),
  'avg_score': 0.6063153490714137,
  'feature_names': ('1', '2', '7', '10', '13')},
 6: {'feature_idx': (1, 2, 3, 7, 10, 13),
  'cv_scores': array([0.62216018, 0.54688102, 0.68800539, 0.63573354, 0.57348901]),
  'avg_score': 0.6132538283818607,
  'feature_names': ('1', '2', '3', '7', '10', '13')},
 7: {'feature_idx': (1, 2, 3, 5, 7, 10, 13),
  'cv_scores': array([0.60117443, 0.59568733, 0.66316904, 0.6325568 , 0.5880102 ]),
  'avg_score': 0.61611956103196,
  'feature_names': ('1', '2', '3', '5', '7', '10', '13')},
 8: {'feature_idx': (0, 1, 2, 3, 5, 7, 10, 13),
  'cv_scores': array([0.60714286, 0.59145168, 0.66981132, 0.64882557, 0.59487834]),
  'avg_score': 0.6224219513639999,
  'feature_names': ('0', '1', '2', '3', '5', '7', '10', '13')},
 9: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 10, 13),
  'cv_scores': array([0.5732576 , 0.61532538, 0.6413169 , 0.62427801, 0.58712716]),
  'avg_score': 0.6082610112259709,
  'feature_names': ('0', '1', '2', '3', '4', '5', '7', '10', '13')},
 10: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 10, 11, 13),
  'cv_scores': array([0.57142857, 0.60329226, 0.63274933, 0.64276088, 0.58212323]),
  'avg_score': 0.6064708539438998,
  'feature_names': ('0', '1', '2', '3', '4', '5', '7', '10', '11', '13')},
 11: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 13),
  'cv_scores': array([0.53475164, 0.60271467, 0.64767039, 0.61725067, 0.56357928]),
  'avg_score': 0.5931933295814698,
  'feature_names': ('0', '1', '2', '3', '4', '5', '7', '8', '10', '11', '13')},
 12: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 13),
  'cv_scores': array([0.55217559, 0.59520601, 0.63659992, 0.60964575, 0.55592622]),
  'avg_score': 0.5899106957732295,
  'feature_names': ('0',
 13: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13),
  'cv_scores': array([0.53696573, 0.59279938, 0.62283404, 0.60107817, 0.53954082]),
  'avg_score': 0.5786436272622256,
  'feature_names': ('0',
 14: {'feature_idx': (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13),
  'cv_scores': array([0.52050443, 0.59279938, 0.62023489, 0.60512129, 0.53228022]),
  'avg_score': 0.5741880424158052,
  'feature_names': ('0',
#O melhor subconjunto foi: ['Dia', 'weekday', 'weekofyear', 'A_Loss', 'H_eFG%', 'A_W/D %', 'A_Wins', 'A_SRS']
cols = ['Dia', 'weekday', 'weekofyear', 'A_Loss', 'H_eFG%', 'A_W/D %', 'A_Wins', 'A_SRS']
treino = X_completo
treino = treino[cols]
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
heatmap = sns.heatmap(treino.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
<ipython-input-40-4aa79b6928d7>:3: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance:
  mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))

Através das correlações, optamos por eliminar as variáveis ‘A_Loss’ e ‘A_Wins’

col = ['Dia', 'weekday', 'weekofyear', 'H_eFG%','A_W/D %', 'A_SRS']
treino = X_completo[col]
teste = teste_completo
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
X = scaler_train.fit_transform(treino)

#Vamos padronizar o teste tbm
scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
teste = scaler_train.fit_transform(teste[col])

#treino e validação
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)

Naive Bayes

from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model_NB = GaussianNB()                       # 2. instantiate model, y_train)                # 3. fit model to data
y_predNB = model_NB.predict(X_test)            # 4. predict on new data

# calcula a acuracia

print('Acuracia Naivy bayes: {:.3f}'.format(balanced_accuracy_score(y_predNB, y_test)))
print("F1 score Naivy bayes: {:.3f}".format(f1_score(y_test, y_predNB, average = "weighted")))
print("Precision Naivy bayes: {:.3f}".format(precision_score(y_test, y_predNB, average = "weighted")))
Acuracia Naivy bayes: 0.632
F1 score Naivy bayes: 0.655
Precision Naivy bayes: 0.656
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model_NB, X_train, y_train, cv=10)

print("Media Cross-val accuracy: %f" % cv_scores.mean())
print("Variância: %f" % cv_scores.var())
[0.68421053 0.68421053 0.69736842 0.67105263 0.73333333 0.65333333
 0.72       0.66666667 0.62666667 0.65333333]
Media Cross-val accuracy: 0.679018
Variância: 0.000930
from sklearn.model_selection import cross_validate

#cv = cross_validate(model_NB, X_train, y_train, return_train_score=True)
cv = cross_validate(model_NB, X, y, return_train_score=True, cv=10)



from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

#Hiper parâmetros para otimizacao
C = np.arange(1,30)
gamma = ["scale", "auto"]
decision_function_shape = ["ovo", "ovr"]
k_fold = 10
#GridSearch para achar a melhor combinação de valores dos hiper parâmetros.
#   aplicando ainda uma validação cruzada com 10 folds.
model_svm = GridSearchCV(SVC(), cv = k_fold,
                     param_grid={"C": C, "gamma": gamma, "decision_function_shape": decision_function_shape}), y_train)
y_pred = model_svm.predict(X_test)

#Mensurar a qualidade do modelo ajustado
print("Acurácia SVM: {:.3f}".format(balanced_accuracy_score(y_test, y_pred)))
print("F1 score SVM: {:.3f}".format(f1_score(y_test, y_pred, average = "weighted")))
print("Precision SVM: {:.3f}".format(precision_score(y_test, y_pred, average = "weighted")))
Acurácia SVM: 0.574
F1 score SVM: 0.629
Precision SVM: 0.629
from sklearn.model_selection import cross_validate
cv = cross_validate(model_svm.best_estimator_, X, y, return_train_score=True, cv=10)

Submetendo NB

y_pred = model_NB.predict(teste)
y_pred = np.array(y_pred, dtype = int)

prediction = pd.DataFrame()
prediction['Game'] = Id
prediction['WinOrLose'] = y_pred

d = {1: 'W', 0: 'L'}
prediction['WinOrLose'].replace(d,inplace = True)
Game WinOrLose
0 0 W
1 1 W
2 2 L
3 3 L
4 4 W
L    130
W     35
Name: WinOrLose, dtype: int64
y = prediction['WinOrLose'].value_counts()/prediction.WinOrLose.value_counts().sum()['L','W'],y)
plt.title('Frequencia Absoluta Vitorias e derrotas')
prediction.to_csv('NB.csv', index = False)

Score no Kaggle: 0.729

Submetendo SVM

y_pred = model_svm.predict(teste)
y_pred = np.array(y_pred, dtype = int)

prediction = pd.DataFrame()
prediction['Game'] = Id
prediction['WinOrLose'] = y_pred

d = {1: 'W', 0: 'L'}
prediction['WinOrLose'].replace(d,inplace = True)
Game WinOrLose
0 0 L
1 1 L
2 2 L
3 3 L
4 4 L
L    136
W     29
Name: WinOrLose, dtype: int64
y = prediction['WinOrLose'].value_counts()/prediction.WinOrLose.value_counts().sum()['L','W'],y)
plt.title('Frequencia Absoluta Vitorias e derrotas')
prediction.to_csv('SVM.csv', index = False)